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Abstract 

' Chargaff's second parity rule (CSPR) asserts that the frequencies of short polynucleotide chains are 

the same as those of the complementary reversed chains. Up to now, this hypothesis has only been 
observed empirically and there is currently no explanation for its presence in DNA strands. Here we 
' argue that CSPR is a probabilistic consequence of the reverse complementarity between paired strands, 

because the Gibbs distribution associated with the chemical energy between the bonds satisfies CSPR. 
We develop a statistical test to study the validity of CSPR under the Gibbsian assumption and we apply 
it to a large set of bacterial genomes taken from the GenBank repository. 
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1 Introduction 



Double helical DNA is made up of two complementary polynucleotide chains, the primary and the secondary 
■ strands, each having opposing polarities. Chargaff's first parity rule is that "the numbers of A's and T's 

and the numbers of C's and G's match exactly in every DNA duplex" Q^. Chargaff's second parity rule 
(CSPR) states that this is valid when looking at a single strand, see 0, and that this happens not only for 
mononucleotides but also for short polynucleotide chains. Chargaff's first parity rule is a simple consequence 
of the double-stranded organization of genomic sequences and the chemistry of nucleic acids which only 
permits A to bond with T and C to bond with G In this work we argue that the reverse complementary 
relationship between nucleic acids on opposing strands also explains Chargaff's second parity rule when we 
assume that randomness manifest in DNA sequences is captured by the Gibbs distribution of the chemical 
energy potential. 

CSPR was first observed experimentally in Bacillus subtilis 2^ and was subsequently confirmed in suf- 
ficiently long sequences available in GenBank for small polymer chains of 3 to 6 bases [3j. More recent 
empirical studies assessing its validity can be found in [H [S] and [B] . 
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A number of possible mechanisms explaining strand symmetry have been proposed, for example, no 
strand biases for mutation and selection [71 [S] and selection of step-loop structures [9] . Further discussion 
of various mechanisms that could support the origins of this intrastrand symmetry are discussed in [10] and 
references therein. In [11] and [12] . a number of mechanisms causing violation of CSPR in short polymers 
are described. 

Here we propose that CSPR manifests due to the constraints that reverse complementarity imposes on 
the Gibbs distribution. In Section [2] we give the framework and state the results. There, the impirical 
polymer frequencies are replaced by polymer occurrence probabilities on a translation invariant probability 
distribution. We then express CSPR using this notion and prove that CSPR written in this way follows 
from the fact that energy symmetry is preserved for the Gibbsian distribution. This is done in Theorem [2] 
In Section [3] we give a characterisation of CSPR for dinucleotides, we prove an extension of the Central 
Limit Theorem for Gibbs measures to vector-valued random variables and we derive an explicit expression 
for the asymptotic covariance matrix. In Section 2] we supply a statistical test for the validity of CSPR for 
dinucleotides under the hypothesis that the nucleotides of the strand are distributed as a stationary Gibbsian 
process. We have applied the test extensively to bacterial genomes available from GenBank. The hypothesis 
of CSPR in the Gibbsian setting is confirmed for a large number of genomes. Further analysis would be 
necessary in order to determine whether genomes rejected by the test were because they fail to comply with 
CSPR, because they are not Gibbsian, or both. 



2 Chargaff 's second parity rule 
2.1 Preliminaries 

Let ^ be a finite set (alphabet) endowed with an involution T : A ^ A: P is one-to-one and P^^ = P. In 
the genomic setting A — {A, C, G, T} and P is an involution given by the complementary function T(A) — T 
and P(C) = G. 

Let X = [xj : j = 0, . . . , n— 1) G A" be the sequence of nucleotides on a strand of the genome (for bacterial 
DNA n 10^). The sequence x complies with CSPR whenever the frequencies of all short polymers agree 
with the frequencies of their reverse complements. In other words, for k small (order of 10) and all polymers 
(ao, . . . ,afc„i) e A'': 

#{j<n-k : Xj ^ap, . . . ,Xj+k-i^ak-i)} _ #{j<n-k : a:^ =r(Qfc_i), . . . , a:j+fc_i =P(ao)} 
n — k + I n — k + 1 

(Here #i? denotes the cardinality of the set B). Observe that the frequency is computed by moving a window 
of length k along the strand. 
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2.2 CSPR as a symmetric probability relation 

We will derive CSPR from the complementary relation in the thermodynamical formalism. In this theoretical 
framework the strands are modeled by bi-infinite sequences and the frequencies of a word are the probabilities 
that they appear at an arbitrary place. We restrict ourselves to translation invariant probability distributions 
that are Gibbs measures with respect to the chemical energy. 

The strands are modeled by sequences in . Thus, x = {xj : G Z) represents the primary strand in 
the sense 5' to 3' while y = {yj : j G Z) represents the complementary strand in the sense 3' to 5'. They 
are related by reverse complementarity: yj = V{x-j) for j G Z. Let us write this rule in another way. 
Let I : — > be the space reversal involution given by (I(a;))j — X-j and let F : — ?> be such 
that (r{x))j = r{xj), for x € A^, j G Z. Then, the rule of reverse complementarity may be written as 
y^ToI{x). 

A genome duplex is the pair (x, y) and we denote by '^{x, y) its chemical energy, which results from the 
interactions between the nucleotides on both strands. Since the interactions between the nucleotides are 
symmetric we assert that 

= $(y,x). (2) 

Insight into this equality may be obtained from the discussion of energy on finite pieces which appears in |13) . 
Let '^i{x[—l, /]; y[—l, I]) be the energy in the portion [— /, /] — {—I, ..-,/} of the duplex. In analogy with [13] 
Page 5, this energy can be assumed to be given by 

'^i{x[-ll]]y[-l,l]) = i^'ik- j;xk,xj)+ ^ iP' {k - j;y^j ,y^k) 

-l<j<k<l -l<j<k<l 

+ 1 E iri\k-j\;x,,y^,). (3) 

-l<j,k<l 

The first two summations are due to the interactions between sites on the same strand while the last one 
expresses the interactions between sites on opposite strands. The quantity a, is the interaction 

between the nucleotides a, b at distance r on the same strand and ip°(r; a, 6) is the interaction between the 
nucleotides a, b in opposite strands such that the distance from the site containing one to the site in front of 
the other is r (recall that y^j is in front of Xj, so the distance from site k containing Xk to the site j, which 
is in front of the site containing y~j, is |/c — The expression ^ is clearly symmetric in x and y. 

Let us express the symmetry relation ([2|) in another way. Since y = T o T{x) , the energy can be simply 
expressed as ^(x) = ^(x, F oI{x)) and the symmetric dependence ^(x^y) — "i{y,x) between the strands 
implies that satisfies the invariance property 

\/x e A^ : = ^'(r o I{x)) ; or equivalently * = * o P o I . 

Next, the set A^ is endowed with the product cr-algebra and let T : A^ — A^ be the translation operator 
given by {T{x))j = xj^i for all j G Z. Let P be a translation invariant distribution on A^, that is 

F{T-^B) ^ P{B), V measurable B C A^ . 
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In the spirit of ([TJ, we say that it satisfies CSPR if 

Va;e^^ Vfc > 1, V(ao,...,afe_i) e A'' : V{xo = ao, . . . ,Xk-i=ak-i) = V{xo = V{ak-i), . . . ,Xk-i=T{ao)) . 

(4) 

We claim that if P is a translation invariant distribution on then property ^ is equivalent to P being 
r o Z— invariant, that is, it satisfies P((r oX)^^B) = P(i?) for all measurable subsets B of A^. Indeed, from 
the equality 

roI"i{x : Xj ^ aj, . . . ,Xk ^ flfe} = {x : x^k ^ r(afe), . . . ,x^j = r(aj)} (5) 

taken together with the translation invariance property, one can show that if P is F o Z-invariant then ^ 
holds. Conversely, the same translation invariance property combined with equality ([5]) may be used to prove 
that (U) implies P*((r o I)^i_B) = F{B) for all cylinders B. Caratheodory's extension theorem then shows 
that this holds for all measurable sets B and the claim follows. 

The following result derives compliance with CSPR from the symmetry of energy and will be proved in 
the next section, but in a more general framework than what is needed for dealing with genomic sequences. 
In the statement of the result we assume that the energy 5" satisfies a 6'-H61der property, which will be 
properly defined later in 

Theorem 1. Assume "i/ : ^ M. is 9-Hdlder for some 9 G (0, 1). Then, the invariance = vj; o T o I 
implies that the unique translation invariant Gibbs measure P* defined on A^ complies with CSPR, that is, 
P — P* satisfies condition (0). 

2.3 Gibbs measures and CSPR 

We employ a more general framework than the one stated in Theorem [TJ Let A he a finite alphabet 
and S = (S(a, b) : a,b £ A) be an aperiodic — 1-valued matrix. The shift of finite type defined by S 
is the set Ah — {x € A^ : = 1 Vj G Z} endowed with the metric Ag{x,z) ~ Q^i^'^)^ where 

K{x,z) = sup{fc > : Xi = Zi V|i| < fc} and 9 G (0, 1) is an arbitrary but fixed value. This metric induces 
the product topology. 

Let 9 G (0, 1) be fixed. Consider the set of Holder (continuous) functions in (Ah, Ag), 

Fe = {g G C(A'h) : \g\e < oo} where \g\e - sup ( MjLl^if^ : x,y € Xs,x ^ z] . (6) 

The linear set Fg is a Banach space when it is endowed with the norm \\g\\e = \\g\\oo + \g\e- 

Each Gibbs measure on Ah is defined by an energy function 5* G -Fe- The value "^(x) represents the energy 
of the system in state x G Afe. In the thermodynamic formalism of shifts of finite type, it has been shown 
(see Theorem 1.2 in jT3) . Pages 5-6) that there exists a unique translation invariant probability measure Pip 
that satisfies 

-in m w w 7 ^ n ^ V^B {X : Xo = Zo, ■ ■ ■ , Xk = Zk) ^ 

30 < ci < C2 < cx),pG R, V^; G AH,Vfc > : ci < ^ — _ k^s-'-^^ir^ ^ ^ < C2 ■ (7) 
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A detailed proof of this result as well as a complete exposition of this topic is given in [T3], Pages 3- 
16. In this reference, it is also proven that the constant p = is the pressure of 4* and that the 

probability measure Pip is the unique translation invariant probability measure satisfying the variational 
principle p{^) = hp^ (T) + J ^'rfP.p , where hp^ (T) is the entropy of T for the translation invariant distribution 

From now on we assume that the aperiodic matrix S also satisfies 

\fa,beA: E{a,b) = E{T{b),r{aj). (8) 

We note that S(a, b) = 1 for all a,b (z A in the genomic framework, so in this context S always satisfies 
condition Hence, Theorem [T] is a straight forward consequence of the following result. 

Theorem 2. Assume E satisfies 0). Let * e Fg. Assume that is T ol-invariant: "^{x) = ^{ToI{x)) 
for all X e Ah. Then the unique translation invariant Gibbs probability measure P^j, is T oT-invariant and 
hence complies with CSPR: 

V fc > , (zo, . . . , Zfe) e A''^^ : P^s, (x : xo = zo,. ■ ■,Xk = Zk) =^1, {x : xo = T{zk), . . . , = r(2:o)) • 

Proof. To begin, let P = Pij, denote the unique T-invariant probability measure on X-^ that satisfies ([7|). 
Define the probability measure P as P{B) — P((r o I)^^B) for all measurable sets B in A^. 
Claim 1: P is translation invariant. This can be proved as follows. Note that I^^ = I and F ^ = F, while 
F commutes with I, T and T^i. So (F o = (F o Z). We also have I o = T o I and hence 

(Fol)-! oT^i = ToFoI. 

Since P is T— invariant, it is also T"-'^ -invariant, so 

P{T-^{B)) =P((roI)-i oT"i(B)) ^P{T oT oI{B)) ^P{ToI{B)) = P((r oI)-i(B)) = P(S) . 

which yields the claim. 
Claim 2: P satisfies 

^ V w ; ^ n ~ , P (a; : xo = zo, . . . , Xfe = Zfe) 
d < Ci < C2 < oo Vz e A= V fe > : Ci < < C2 . 

Note that once this claim has been shown, the result will immediately follow because uniqueness of P implies 
P = P, and so P is FoZ-invariant. To prove the claim, first observe that since F~^ — F and P is T-invariant, 

F{x : xq = zq,. . . ,Xk ^ Zk) ^P {x : (F(Z(x)))o = zq, . . . , {T{I{x)))k = Zk) 
= V{x : T{xo) = zo, . . . , T{x-k) = Zk) = P {x : xq = F(zo), . . . , X-k = r(zfe)) 
= P{x : xq = F(zfe), ...,xk= F(zo)) ^P{x : xq ^ (F oZ(z))_fc, . . . , Xfe = (FoZ(z))o) 
= P{x:xo = (r-'=(rol(z)))o,...,xfc = {T-''{ToI{z)))k) . 
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On the other hand, from the equahty T* '^(F o T(z)) = F o I[T^ '(^)) ^''^d using the fact that is 

F o I- invariant, we obtain 

fe-1 fe-1 fe-1 fe-1 

^■^{T'T-^^oI{T-^z))) = ^'J'(roI(T'=-*-iz)) = ^-^{T^-'-^z) = ^«'(r'(z)). 

i=0 4=0 4=0 4=0 

Hence 

P : xo = 00, . . . , a;fe = zfe) _ P (a: : xo = (T-''^(r o X(z)))o, • . • , Xfc = (r-^(r o I(z)))fe) 



e 

We note that 



e-pfc+E-ro'*(T*T-i(z)) 
Vz G A"^ V fc > : 5i < 7-i < 52 , 



with Si = e™" and C2 = e'"^'^ Then, 



(9) 



ci < r-i = < C2 . (10) 

Hence from ([9]), ([7]) and (fTO]). we deduce that Claim 2 holds. 

Hence, P = P and the proof is complete. □ 



3 CSPR for dinucleotides 

3.1 A 5-dimensional characterisation of CSPR 

Henceforth, we shall focus on the dinucleotide distributions under CSPR. Let P be a translation invariant 
distribution on J^. As stated, CSPR means that for all i? > 1, we have 

V (ao, . . . , a_R_i) e A"^ : T{x : xq = ao, . . . , xr-i = aR^i) = P(x : xq = T{aR-i), xr-i = T{ao)) ■ 

(11) 

If the set of equalities PT|) holds for some R = Rq, we say that CSPR holds for Rq. In this case, by taking 
appropriate marginals, the equalities also hold for all positive integers R < Ra- 

Now, for discussing CSPR for i? = 2, it is convenient to introduce the following notation. Let [ab]k be 
the event {x : Xk — a, Xfe+i = b}. Since P is translation invariant, we have P([a6]fe) — P([a6]o) for all k E Z 
and a,b E A. Therefore, CSPR for R = 2 reduces to 

ya,beA: ¥{[ab]o) - P([r(6)r(a)]o). (12) 

This equality implies CSPR for R ^ 1: P([a]o) = P([r(a)]o) for a e A, where [a]o = {x : xq = a}. 

We want to test the hypothesis Hq: CSPR holds for R = 2. In order to construct such a test, it is useful 
to introduce the following quantities: 

/ = if {a, b) : {a, b) E A^) where /(a, b) P([a6]o) - P([r(6)r(a)]o) . (13) 
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From (Hg), CSPR for i? = 2 is satisfied if and only if / = 0. 

We remark that 4 of the above 16 quantities /(a, b) vanish. More precisely, whenever (a, b) — (c, r(c)) for 
some c (z A, we see that /(a, b) = 0. Moreover, among the remaining 12 terms, only 5 are meaningful since 
f{a,b) = — /(r(6), r(a)) for any a,b £ A, and J2ceA ■fi^^'^) ^ Sce^i /("^j a G ^. In the following, 
we fix an index set /C = {{A, A), {A, C), {A, G), (C, A), (C, C)} for 5 of these values and gather them together 
into a vector /'^ {f{a,b) : {a,b) G /C). Using this alternative representation, the null hypothesis Hq is 
satisfied if and only if /'^ ~ 0. 

3.2 Covariances and the Central Limit Theorem 

Here, we present some results that are essential for developing an asymptotic test for the hypothesis Hq: 
CSPR holds for i? = 2, in the setting of Gibbs distributions. 

Let P = be Gibbsian for some 4' G Fg, with 9 G (0, 1) fixed. We begin by giving a simple computation. 
Let E = E,p denote the expectation operator associated with P*. A function g ^ Fe is said to be of zero 
mean if K{g) = 0. 

In this section we assume (p^ , ...,1^3' are zero mean functions in Fg and set (p = {ip^ , . . . , (p'). We shall 
consider :— ip'^ o for i > and k — 1, . . . ,1, and define for n> 1, 

-. n — 1 

^« - 7^ E • (14) 

Proposition 3. The limits 

Y,'^{k,j)= lim E,i,(S'^ S"^) exist for all k,j £ {1, ... ,1} and 

00 00 

j) = E* (X^- X^) + ^ E* [X^ Xi) + ^ E* ) . (15) 

1=1 i=l 

The matrix T,^ — (E'^(fc, j) : fc, G {1, ...,/}) is .symmetric and semi-positive definite. 

Moreover the convergence of the two summations on the right-hand side of Iil5\) occurs at a geometric 
rate, more precisely, 

3l<c5o,ee(0,l), Vfc,J■ = l,...,/,V^>l : ¥.^{X^Xl) <5C ■ (16) 

Proof. By expanding the terms in the sum and using the translation invariance property E(Xf X^?) = 
E(Xq Xl_^) for all fc, j G {1, . . . , 1} and i < r, we get 

n— 1 -,71 — li—l -.n — li — 1 



I— i—l r— i—1 r— 

n—1 n—1 

= E ^^■) + - E - (^0 + - 5] (n - z)E (Xl X 

1=1 1=1 



Since ip^ G -Fe for each fc, the exponential cluster property of Gibbs measures (see Property 1.26 on Page 23 
in [13]) guarantees the existence of ^ < 00, and ^ G (0, 1) only depending on 9 and 5*, such that for all 
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k,j e{l,...,l} , 

As a consequence, ([T6| is satisfied. Hence all the series are absolutely convergent. Moreover, since J2tLo ^^{-^o-^i) 
is finite, the Cesaro mean of iK{XQXf) converges to zero and we obtain the formula 

oo oo 

^lim E(5^ Si)=E [X^ + ^ E [X^ X/) + ^ E (Xfj" X^) = E^^. . 

1=1 1=1 

Finally, we see from this explicit expression that the matrix Yi'^ is symmetric and semi-positive definite 
because each matrix (E(S','^5^) : k,j E {1, ...,/}) is a covariance matrix. Hence, its limit S"^ is also semi- 
positive definite. □ 

Next, we show a Central Limit Theorem for random vectors in the Gibbs framework, which is a corollary 
of the Central Limit Theorem given in |14j . 

Proposition 4. // E*^ = : fc,j = 1,...,/) given by i l73|) is positive definite then the vector 

process Zn (5^, . . . , ^'J (where is given in !114\)) converges in distribution to the multivariate normal 
vector 7V(0,E'^). 

Proof. We recall that the Central Limit Theorem shown in [M] says that if a function g G Fg is of zero mean 
and 

<j{g)' + where a{gf := X^^J^E^ goT^ j , 

^ I E ° -^(0' '^(5)') ■ (17) 

Since Fq is Banach, for all a — (ai, . . . ,ai) the function — Sfc=i Q!fc9''^ is in ^e- Since has zero 

mean and 'YTi=^k'^a ° T^)/V^ — Q^'^n, where a' denotes the transpose of a vector a, the Central Limit 
Theorem (fT7|) gives 

a'Z„ — ^AA(0,CT^), (18) 



then 



where 



al= lim iElfl^V^orA 



provided that ct^ 7^ 0. Now, cr^ 7^ will be obtained as a consequence of the fact that 

(72 =a'S'^a = ^^a,a,E'^(fc,j)- (19) 

fe=i j=i 

This together with the assumption that S*^ is positive definite allows us to determine that cr^ = a'S'^a > 
for all a 7^ 0. Furthermore, (fT5|) and (|19p yield 

Vs e M : lim E(e'''"'^") = e^^'^''^' = g-is^a's'-a ^ 
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which is the characteristic function of a A^(0, a'S'^a) random vector. Convergence of Z„ in distribution to 
an Af{0, S"^) random vector then follows from Levy's continuity theorem. 
It only remains to prove p^ . Notice that 

n— 1 I n—1 I n— 1 / 

i=0 k=l i=0 k=l i=0 fc=l 

which implies that 

al = hm E I lj2^kS!A ] =^^afca, Hm E{S^Si) . 

\ \fc=l / / fe=lj = l 

Finally, Proposition [3] asserts that lim„^oo ^(S'^S'^J — and hence the result follows. □ 

4 Testing under the Gibbsian assumption 

4.1 A statistical test 

Recall that the hypothesis Hq: CSPR for i? = 2, is equivalent to / = 0, where / was defined in ([T3l) . Let 
us introduce estimators of the various quantities involved in testing this. For any finite observed sequence 
X = {Xo,...,Xn-i), let 

?f r,,. Nr,{a,b) iV„(r(&),r(a)) 
n n 

where 

7V„(a, b) := #{fc e {0, . . . , n - 1} : {Xk,Xk+i} = (a, &)} 

counts the number of occurrences of the pattern ah in the sequence. Note that we treat the sequence X as 
though it were circular with Xn-i connected to Xq, so that Xn = Xq. 
We shall show that one appropriate statistic for assessing this test is 

where = ^/„(a, 6) : {a,b) e ICj is a consistent unbiased estimator of /'^ and Vn is a consistent biased 
estimator of the asymptotic covariance matrix V of y/nf^ which we shall define shortly. Furthermore, we 
shall prove that rjn converges asymptotically in distribution to a xi random variable. Then, sufficiently large 
values of fjn will identify sequences that fail to comply with CSPR for R = 2. 
More precisely, the test is set up as follows: 

Reject Hq if fjn > s, 

where s is some threshold to be chosen. If a is the type I error desired for the test (for instance a — 0.05 or 
0.01), then we require that 

P//„ (reject Hq) - P//„(?/„ > s) < a. 
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either exactly or asymptotically. Doing this exactly is not feasible in the current setting, but fj„ — - — > Xsj 
where — - — > denotes convergence in distribution. Thus the threshold s can be fixed asymptotically by 
appealing to the distribution on 5 degrees of freedom. We merely have to set s to the 1 — a quantile 
X5,i-a of the xl distribution. 



4.2 Asymptotics of the test statistic 

In order to construct this asymptotic test, we make the further assumption that the distribution P = ¥^j, 
is Gibbsian for some energy ^ G Fg, where 6 e (0, 1). Recall that P is ergodic. Let E denote the mean 
expected value operator associated with P. 
Firstly, 



From ergodicity the law of large numbers holds and so 

lim ^"^"'^^ = P([a6]o) P - a.e. and hence lim ^ = /'^ P - a.e. 

Therefore is a consistent, unbiased estimator of f'^. 
Next define 

^ = (V?"'" : (a, b) e A^) where ip^'" - P{[ab]o) , 

where, as usual. Is is the characteristic function of the set B. Observe that for all (a, 6) G we have 
if"''' £ Fg. For i > and n > 1, define 

n-l 



V (a, b)£A^ : = ^"'"^ o T and 5^^'' = ^ I] 



n 

i=0 



A simple calculation that takes advantage of the T-invariance of P gives 

- {l[abU - IP([«6]o)) and Sf-" = ^ (iV„(a, b) - nP([a6]o)) . 

y n 

A straight forward application of Proposition[3]can be used to show existence of the matrix E"^ = (S'^(a, 6; c, d) 
(a, 6), (c, d) e -A^), whose elements are defined by 

E'^(a,6;c,d) lim E(5;j''' 5-^) . 

n— >-oo 

(Note that for simplicity we write I]'^(a, 6; c, c?) rather than I]'^((a,6), (c, d))). Furthermore, using (fT5l) from 
the same lemma, we can see that 

oo oo 

E^(a, b- c, d) = E (X^'" Xl^'-'^ + ^ E (x^'" xf) + J] E (x^^^' , 

i=l i=l 

and that E"^ is symmetric and semi-positive definite. 
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Further simple computations enable us to write the elements of S'^ explicitly as 

OG 

T.'^{a,h-c,d) = P([a6]o n [cd]o) - P([a6]o)P(M]o) + ^ [P([afe]o n [cd],.) - P(H]o)P([cd]o)] 

k=l 

oo 

+ ^ [P([cd]o n [ab]k) - P([a6]o)P(M]o)] . 
fe=i 

As a corollary to Proposition HI we obtain: 



Proposition 5. Assume E"^ is positive definite. Then, the joint distribution of the counts iV„(a, &) 
asymptotically satisfy 



The following is then obtained by taking appropriate marginals in the preceding result. 
Corollary 6. Assume S"^ is positive definite. We have 

v^(i^-/'=)-^AA(0,F), 
where the covariance matrix V — {V{a,b]c,d) : (a, &), (c, c?) G /C) is given by 

V{a, b; c, d) = S'^(a, 6; c, d) + E^(r(6), r(a); r(d), r(c)) - S'^(r(6), r(a); c, d) - E'^(a, 6; r(d), r(c)) . 
From this result we conclude under the hypothesis Hq : /'^ = that 



As a consequence, nf^'V converges in distribution to a distribution on 5 degrees of freedom, 

provided that V is positive definite. 

Observe that /J = ^AA^^^^, where = (A^„(a, b) : (a, b) € A'^) and A (A(a, 6; c, d) : (a, 6) e /C, (c, d) € 
is the 5 X 16 matrix given by 

1, if (a, fo) = (c, d), 
A(a, 6; c, d) := <( -l, if (a, 6) = (r(d), r(c)), 
0, otherwise. 

The covariance matrix V may then be written as = AE'^'A'. Since A is of full rank, V is positive definite 
whenever T,'^ is positive definite. 

Proposition 7. Assume that E"^ is positive definite. Then, there exists a consistent estimator Vn of V 
such that rjn '■= ^fn'^n^fn convcrgcs in distribution to a xi random variable. 

The proof of this proposition will be a consequence of the following constructions and intermediate results. 
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In order to define the estimator Vn of tlie covariance matrix V, we first require an estimator of S*'. Let 
£n,m = (£n,m(a, b; c, d) : (a, b), (c, d) e A^^ , where 

, , _ Nj°\a,b;cJ) Nn{a,b) N^{c,d) 

^n,m\0-,0]C,a) :— • 

n n n 

/M'^(a,6;c,d) Nr^{a,b) Nn{c,d)\ 



»=i 



n n n 



and 



^ ( Nli\c,d;a,b) N^{a,b) jV„(c,d) \ 



iV«(a, 6; c, d) := #{j e {0, . . . , n - 1} : (X,-, = (a, 6, c, rf)} . 



Recall that we treat genome sequences as circular, so that Xn+i = Xi for i = 0, . . . , n — 1. 
Now, from the law of large numbers for Gibbs measures, 

lim ^"H^'^;^'^) = ^^ab]o n [cd]i) P - a.e. 

n—>oo fl 

and so 

lim i;„,m(a, 6; c, d) = Sf^-, (a, 6; c, d) P - a.e. 

n— >cx> v"-' 

where 

I]^„)(a, 6; c, d) = P(Ho n [cd]^) - P(Ho)P(Mo) + ^ [P(Ho n Mi) - P([a6]o)P(M]o)] 

i=l 

m 

+ ^ [P(Mo n [ab],) - P([a&]o)P(M]o)] . (21) 



However, 



S*'(a,6;c,d)= lim .(a, 6; c, d). (22) 



Now, we claim that there exists a sequence (m(n) : n > 1) which monotonically increases to oo such that 

S;n,m(n) («, ^1 c, d) )■ E'^ (a, 6; c, d) , 

P 

where !► is used to denote convergence in probability. To show this, first recall that convergence in 

n— >-oo 

probability is metrizable by some metric D, for instance, D{g,h) = ^{{\g — h\/{l + \g — h\)). Since 
lim DCEn m{o;b;c,d),T,f .{a,b;c,d)) = 0, we deduce that for all m > 1, there exists a positive integer 
N{m) satisfying 

Vn> N{m),yk€ {l,...,m} : D(S„,fc(a,6;c,d),S^^j(a,6;c,d)) < 1/m. 

The sequence {N{m) : m > 1) is increasing. Now for all n < N{1), we set m(n) = 1 and, for n > N{1), 
we define m(n) = sup{m : N{m) < n}. By construction m(n) increases with n. On the other hand, since 
m(n) > m for all n > N{m), we have lim m(n) = oo. By construction we have 

Vn> iV(l), : £»(I]„,„(„)(a,6;c,d),E^^(^)^(a,6;c,d)) < l/m(n) , 
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and the claim follows by letting m — >■ cx) and taking account of . 
Let Vn.rn = (Vn,m{a, b] c, d) : (a, &), (c, d) e ICj, where 

K,m(a, 6;c, d) := i;„,„i(a, 6; c, d) + E„^„i(r(fe), r(a); r(d), r(c)) - S]„,,„(r(6), r(a); c, d) - i;„,,„(a, 6; r(d), r(c)). 

Since S„,m(n) converges in probability to 'E'^, it follows that V^i := Ki.m(n) rnust converge in probability 
to V. Furthermore, will also converge to V^^ in probability, provided that V is positive definite. 

To summarize, we have shown that is a consistent (unbiased) estimator of /'^ while Vn constitutes a 
consistent (but biased) estimator of V. 

Next, we prove the asymptotic behavior of ry„ claimed in Proposition [T] Recall that V is positive definite 
since S''' is positive definite. We have shown that \/nf^ converges in distribution to Af{0,V) and V^^^ 
converges in probability to V^^. This implies that 

nf^'V-'f^ - nf^'V-'f^ = nf^' (y-^ - V'') jf 0. 

Combining this with the aforementioned fact that nf^'V^^f^ converges in distribution to a xi random 
variable, we see that 7y„ converges in distribution to a xi distribution as n — > oo and hence Proposition [7] 
has been proved. 



4.3 Practical considerations 



When computing the statistic rfn for a real genomic sequence, the parameter n is dictated by the length 
(in bases) of the genome under study. However, it is necessary to choose an appropriate value for the 
parameter m and this is not so straight forward. The regime (m(n)) derived in the previous subsection is 
not unique. In fact, the convergence results in the preceding subsection remain valid for any sequence that 
converges to oo more slowly than (m(n)). Consequently, any value m(n) <^ n should satisfy the consistency 
criterion. On the other hand, the exponential cluster property of Gibbs measures implies that terms of the 
series in (PT|) should tend geometrically toward zero and the same should also be true for terms of the series 
in (|20p when n is large. Eventually, there should come a point after which the terms of (pO|) will constitute 
noise of the estimators and these should be ignored. Consistency of the estimator Sn,m, together with the 
exponentially fast convergence of S^^^ to E*^, means that satisfactory results should be obtainable by setting 
m{n) small relative to n when computing rfn- 

In our implementation of this test of CSPR for dinucleotides, we chose m{n) to be the smallest value of i 
such that 



Ni'\a,b,a,b) /iV„(a,6) 



< 0.01 



iV„(a,&) fN„{a,b) 



0.01Var([a6]o) V(a, 6) G A"^. 



Here, Var([a5]o) denotes a consistent estimator of the variance of the frequency of the dinucleotide ab in 



a genome sequence. Since 



Nn{a,b) (Nn{a,b) 



is typically on the order of 0.06, we conjecture that 
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truncating the covariance estimators at the point where the sums composing the estimators change by less 
than 1% of Var([a6]o) is reasonable. 

Finally, some numerical experimentation leads us to conjecture that the test is highly powerful. Markov 
chains constitute a subset of the Gibbsian processes. Simulating genomic sequences from Markov chains 
which fail to comply with CSPR yields a rejection rate of 100% at the 5% significance level. Performing the 
same experiments on Markov chains that do satisfy CSPR results in a rejection rate close to the a chosen for 
the test, as one would expect. We obtained similar results for genomic sequences generated as realizations of 
Markov random fields. A Markov random field with maximal clique size k is equivalent to a Gibbs measure 
whose energy ^! takes the form 

k n— 1 

*(^) =Im^^''^(^^'•■•'^^+J-l)• 
In other words, the energy is a linear sum of functions '^'^^\ each of which depends on cliques of size j, that is, 
sets of i mutually neighboring sites. Simulations of such sequences can be produced using the Gibbs sampler 
and 5* will be F o I-invariant if ^^■'^ is F o I- invariant for all j = 1, . . . , k. In our numerical experiments, 
we simulated sequences from Markov random fields having maximal clique sizes of 3 and 4, using an energy 
which is not F o I-invariant. 

4.4 Application of the test 

Although successful tests of CSPR in genomic sequences have already been conducted using both empirical 
and rigorous methods (for instance, see [31 HI O [6]), we would like to test for CSPR for R = 2 under a 
Gibbsian hypothesis. 

We considered a set of 1049 complete bacterial genome sequences obtained from the GenBank 'genomes' 
repository. Length and GC-content statistics for the set of genomes are shown in Table [T] 

Table 1: Summary statistics for the lengths and GC-contents of a collection of 1049 complete bacterial 
genome sequence obtained from the GenBank repository. 



Property 


First Quartile 


Median 


Third Quartile 


Mean 


Std Deviation 


Length 


1906322 


2976212 


4603746 


3317355 


1759175 


GG-content 


0.3769 


0.4753 


0.6035 


0.4839 


0.1326 



To correct for multiple testing of a large number of genomes, we used the Holm- Bonferroni method, 
whose application posed no difficulties since p-values for the test statistic are readily obtainable from the 
xl distribution. Of the 1049 genomes tested at the a ~ 0.01 level of significance, the null was accepted in 
410 cases and was rejected in the remaining 639 genomes. We found no relationship between GG-content, 
genome length and rejection of the null. 
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Note that the Gibbsian assumption determines the form of the covariance matrix S*', which exists as 
a consequence of the exponential cluster property. Any genomic sequence that departs significantly from 
exponential clustering will give rise to an ffn far out in the tail of the xt distribution, since T,^ is likely to 
be near singular in such cases. A caveat with the test proposed here is that when a sequence is rejected, the 
reason for its rejection is unclear. Rejection could be due to either violation of CSPR or lack of compliance 
with the Gibbsian or translation invariance assumptions. In any case, further examination is warranted in 
order to discover why a particular sequence is rejected. On the other hand, given the test's apparently high 
sensitivity to departures from a Gibbsian structure, sequences for which the null hypothesis is accepted must 
comply much more closely to CSPR and exhibit a much stronger Gibbsian-like structure than those that are 
rejected. 
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